21 research outputs found
A model of Poissonian interactions and detection of dependence
This paper proposes a model of interactions between two point processes,
ruled by a reproduction function h, which is considered as the intensity of a
Poisson process. In particular, we focus on the context of neurosciences to
detect possible interactions in the cerebral activity associated with two
neurons. To provide a mathematical answer to this specific problem of
neurobiologists, we address so the question of testing the nullity of the
intensity h. We construct a multiple testing procedure obtained by the
aggregation of single tests based on a wavelet thresholding method. This test
has good theoretical properties: it is possible to guarantee the level but also
the power under some assumptions and its uniform separation rate over weak
Besov bodies is adaptive minimax. Then, some simulations are provided, showing
the good practical behavior of our testing procedure.Comment: 27 page
Variable selection using Random Forests
International audienceThis paper proposes, focusing on random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001, to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good parsimonious prediction model. The main contribution is twofold: to provide some experimental insights about the behavior of the variable importance index based on random forests and to propose a strategy involving a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy
Random Forests for Big Data
Big Data is one of the major challenges of statistical science and has
numerous consequences from algorithmic and theoretical viewpoints. Big Data
always involve massive data but they also often include online data and data
heterogeneity. Recently some statistical methods have been adapted to process
Big Data, like linear regression models, clustering methods and bootstrapping
schemes. Based on decision trees combined with aggregation and bootstrap ideas,
random forests were introduced by Breiman in 2001. They are a powerful
nonparametric statistical method allowing to consider in a single and versatile
framework regression problems, as well as two-class and multi-class
classification problems. Focusing on classification problems, this paper
proposes a selective review of available proposals that deal with scaling
random forests to Big Data problems. These proposals rely on parallel
environments or on online adaptations of random forests. We also describe how
related quantities -- such as out-of-bag error and variable importance -- are
addressed in these methods. Then, we formulate various remarks for random
forests in the Big Data context. Finally, we experiment five variants on two
massive datasets (15 and 120 millions of observations), a simulated one as well
as real world data. One variant relies on subsampling while three others are
related to parallel implementations of random forests and involve either
various adaptations of bootstrap to Big Data or to "divide-and-conquer"
approaches. The fifth variant relates on online learning of random forests.
These numerical experiments lead to highlight the relative performance of the
different variants, as well as some of their limitations
VSURF : un package R pour la sélection de variables à l'aide de forêts aléatoires
National audienceThis paper describes the R package VSURF. Based on random forests, it delivers two subsets of variables according to two types of variable selection for clas-sification or regression problems. The first is a subset of important variables which are relevant for interpretation, while the second one is a subset corresponding to a parsimo-nious prediction model. The strategy is based on a preliminary ranking of the explanatory variables using the random forests permutation-based score of importance and proceeds using a stepwise ascending variable introduction strategy. The two proposals can be ob-tained automatically using data-driven default values, good enough to provide interesting results, but can also be fine-tuned by the user. The algorithm is illustrated on a simulated example and its applications to real datasets are presented.Dans cette présentation, nous décrivons VSURF, un package R. Basé sur les forêts aléatoires, il fournit deux sous-ensembles de variables associé a deux objectifs de sélection de variables pour des problèmes de régression et de classification. Le premier est un sous-ensemble de variables importantes pour l'interprétation. Le second est un sous-ensemble parcimonieux a l'aide duquel on peut faire de bonnes prédictions. La stratégie générale est basée sur un classement préliminaire des variables donné par l'indice d'importance des forêts aléatoires, puis utilise un algorithme d'introductions ascendantes de variables pas a pas. Les deux sous-ensembles peuvent être obtenus automatiquement en gardant le comportement par défaut du package, mais peuvent également être réglés en jouant sur plusieurs paramètres. Nous illustrons la méthode sur plusieurs jeux de données réelles
Variable selection using Random Forests
International audienceThis paper proposes, focusing on random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001, to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good parsimonious prediction model. The main contribution is twofold: to provide some experimental insights about the behavior of the variable importance index based on random forests and to propose a strategy involving a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy
Variable selection through CART
International audienceThis paper deals with variable selection in the regression and binary classification frameworks. It proposes an automatic and exhaustive procedure which relies on the use of the CART algorithm and on model selection via penalization. This work, of theoretical nature, aims at determining adequate penalties, i.e. penalties which allow to get oracle type inequalities justifying the performance of the proposed procedure. Since the exhaustive procedure can not be executed when the number of variables is too large, a more practical procedure is also proposed and still theoretically validated. A simulation study completes the theoretical results
Inference of functional connectivity in Neurosciences via Hawkes processes
1st IEEE Global Conference on Signal and Information Processing 3-5 Dec. 2013 , Austin (USA)We use Hawkes processes as models for spike trains analysis. A new Lasso method designed for general multivariate counting processes enables us to estimate the functional connectivity graph between the different recorded neurons.nonnonrechercheInternationa
VSURF: An R Package for Variable Selection Using Random Forests
International audienceThis paper describes the R package VSURF. Based on random forests, and for both regression and classification problems, it returns two subsets of variables. The first is a subset of important variables including some redundancy which can be relevant for interpretation, and the second one is a smaller subset corresponding to a model trying to avoid redundancy focusing more closely on prediction objective. The two-stage strategy is based on a preliminary ranking of the explanatory variables using the random forests permutation-based score of importance and proceeds using a stepwise forward strategy for variable introduction. The two proposals can be obtained automatically using data-driven default values, good enough to provide interesting results, but can also be tuned by the user. The algorithm is illustrated on a simulated example and its applications to real datasets are presented